After the very succesful impact evaluations you have performed in the past weeks, you are contacted by the local government of Pawnee, Indiana. The city is interested in your advice to assess a policy intervention passed with the support of councilwoman Leslie Knope.
The city of Pawnee has been at the spotlight recently, as it has come to be known as the child obesity and diabetes capital of the state of Indiana. Some of the constituents of the city point at the fast food culture and soda sizes across the restaurants in town as a source of the problem. The largest food chain in Pawnee, Paunch Burger, offers its smallest soda size at a whopping 64oz (about 1.9 liters).
The “soda tax”, as it came to be known, came to effect initially at a couple of districts. Fortunately for you, based on an archaic law, residents of Indiana have to demonstrate their residence in the district they intend to dine before being served at any of the restaurants. The latter fact means that Pawnee inhabitants can only buy sugar-added drinks in their respective home district.
set.seed(42) #for consistent results
library(dplyr) # to wrangle our data
library(tidyr) # to wrangle our data - gather()
library(ggplot2) # to render our graphs
library(haven) # to load our .dta files for the assignment
library(plm) # to run our fixed effects models
library(kableExtra) # to render better formatted tables
library(stargazer) # for formatting your model output
soda_tax_df <- read.csv("./data/soda_tax_df.csv") # simulated data
names(soda_tax_df) # to check the names of the variables in our data
## [1] "id" "district" "treatment" "pre_tax" "post_tax" "change"
Our dataset soda_tax_df, contains the following information:
ìd: A unique number identifier for each of the 7,500 inhabitants of Pawneedistrict: The name of the district in which the corresponding unit livestreatment: A binary variable that signals whether the subject lived in a district where the tax was implementedpre_tax: The weekly sugar-added drink consumption in ounces before the tax was imposedpost_tax: The weekly sugar-added drink consuption in ounces after the tax was imposedchange: The difference in sugar-added drink consuption between post- and pre-taxWe can use the base R table() function, which performs categorical tabulation of data with the variable and its frequency, to check how many people live on each district.
table(soda_tax_df$district) %>%
kable() %>% # create kable table
kable_styling() # view kable table
| Var1 | Freq |
|---|---|
| City Hall | 1250 |
| Old Eagleton | 1250 |
| River Park | 1250 |
| Snake Lounge | 1250 |
| The Pit | 1250 |
| Wamapoke Quarter | 1250 |
Even more, we can create a two-way cross-table to check how many treated units live in each district.
table(soda_tax_df$treatment, soda_tax_df$district) %>%
kable() %>% # create kable table
kable_styling() # view kable table
| City Hall | Old Eagleton | River Park | Snake Lounge | The Pit | Wamapoke Quarter | |
|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1250 | 1250 | 1250 |
| 1 | 1250 | 1250 | 1250 | 0 | 0 | 0 |
We can also utilize summary() to check the distribution of our variables. In this case, we can look at the distribution of our variable of interest at the two points in time:
summary(soda_tax_df$pre_tax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 400.6 506.6 509.7 610.8 1687.6
summary(soda_tax_df$post_tax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 327.0 445.9 449.8 566.1 1705.6
So far we have ignored time in our estimations. Up until this point, most of the tools we have learnt produce estimates of the counterfactual through explicit assignment rules that work random or as-if-random (e.g. randomized experimental, regression discountinuity, and instrumental set-ups).
Difference-in-differences compares the changes in outcomes over time between units under different treatment states. This allows us to correct for any differences between the treatment and comparison groups that are constant over time assuming that the trends in time are parallel.
pre_tax baseline measure, we would likely utilize the post_tax to explore the average effect on the treated. In this case, we would model this as:
after_model <- lm(post_tax ~ treatment, data = soda_tax_df)
stargazer(after_model, type = "html")
| Dependent variable: | |
| post_tax | |
| treatment | -146.918*** |
| (3.798) | |
| Constant | 523.273*** |
| (2.686) | |
| Observations | 7,500 |
| R2 | 0.166 |
| Adjusted R2 | 0.166 |
| Residual Std. Error | 164.465 (df = 7498) |
| F Statistic | 1,496.245*** (df = 1; 7498) |
| Note: | p<0.1; p<0.05; p<0.01 |
We could read this result substantively as: those who lived in districts were the tax was implemented consumed on average 146.9 ounces less of sugar-added drinks per week compared to those who lived in districts were the tax was not put in place. This calculation would give us a comparison of the treatment and control groups after treatment.
To believe the results of our after_model, we would need to believe that the mean ignorability of treatment assignment assumption is fulfilled. We would have to think carefully about possible factors that could differentiate our treatment and control groups. We use a treatment indicator based on the districts where the measure was able to be implemented. Treatment was not fully randomly assigned, so there may be lots of potential confounders that create baseline differences in the scores for those living in Old Eagleton compared to those in Snake Lounge, which also affect the after-treatment comparisons.
We can introduce the time component to our calculation by incorporating the pre-treatment levels of sugar-added drink consumption, which gives us the diff-in-diff estimand. We could calculate this in a fairly straightforward manner:
did_model <- lm(change ~ treatment, data = soda_tax_df)
stargazer(did_model, type = "html")
| Dependent variable: | |
| change | |
| treatment | -149.744*** |
| (0.246) | |
| Constant | 14.967*** |
| (0.174) | |
| Observations | 7,500 |
| R2 | 0.980 |
| Adjusted R2 | 0.980 |
| Residual Std. Error | 10.671 (df = 7498) |
| F Statistic | 369,242.400*** (df = 1; 7498) |
| Note: | p<0.1; p<0.05; p<0.01 |
We could read this result substantively as: those who lived in districts were the tax was implemented consumed on average 149.7 ounces less of sugar-added drinks per week compared to those who lived in districts were the tax was not put in place. This calculation would give us the change, or difference, in sugar-added drink consumption for treatment and control groups.
To believe the results of our did_model, we would need to believe that there are parallel trends between the two groups.
Note: when simulating the data the post_tax was defined as: \(post_tax = 15 + pre\_tax - 150(treatment) + error\)
\[Y_{it} = β_0 + β_1D^*_i + β_2P_t + β_{DD}D^∗_i × P_t + q_{it} \]
We need to convert our data to a long format to render the time and treatment dummy variables. This is how our data looks like:
head(soda_tax_df, 6)
## id district treatment pre_tax post_tax change
## 1 1 Snake Lounge 0 1687.6438 1705.5791 17.935292
## 2 2 Snake Lounge 0 427.2953 438.2526 10.957361
## 3 3 Snake Lounge 0 566.4693 559.6664 -6.802863
## 4 4 Snake Lounge 0 606.9294 623.9057 16.976316
## 5 5 Snake Lounge 0 572.6402 606.8654 34.225177
## 6 6 Snake Lounge 0 496.0813 501.8001 5.718800
We will utilize the pivot_longer() function from tidyr to format our data frame.
soda_tax_df_long <- soda_tax_df %>%
select(id, pre_tax, post_tax, treatment) %>% # select the columns we are interested in
pivot_longer(cols = c(pre_tax, post_tax), names_to = "period", values_to = "soda_drank") %>% # grab columns, put names to new variable period and values to new variable soda_drank
mutate(after_tax = if_else(period == "post_tax", 1, 0)) # create dummy for period
head(soda_tax_df_long, 6)
## # A tibble: 6 x 5
## id treatment period soda_drank after_tax
## <int> <int> <chr> <dbl> <dbl>
## 1 1 0 pre_tax 1688. 0
## 2 1 0 post_tax 1706. 1
## 3 2 0 pre_tax 427. 0
## 4 2 0 post_tax 438. 1
## 5 3 0 pre_tax 566. 0
## 6 3 0 post_tax 560. 1
We can see that under our long format, we have two entries for every individual. We have our variable after_tax, which represents \(P_t\), where 0 and 1 are pre- and post-tax periods respectively. We can now render our regression based on the formula:
did_long <- lm(soda_drank ~ treatment + after_tax + treatment * after_tax, data = soda_tax_df_long)
We can think about the results of this regression in terms of this table, where \(\beta{DD}\) is the difference in differences in group means:
stargazer(did_long, type = "html")
| Dependent variable: | |
| soda_drank | |
| treatment | 2.827 |
| (3.799) | |
| after_tax | 14.967*** |
| (3.799) | |
| treatment:after_tax | -149.744*** |
| (5.373) | |
| Constant | 508.306*** |
| (2.686) | |
| Observations | 15,000 |
| R2 | 0.117 |
| Adjusted R2 | 0.117 |
| Residual Std. Error | 164.502 (df = 14996) |
| F Statistic | 664.480*** (df = 3; 14996) |
| Note: | p<0.1; p<0.05; p<0.01 |
If we go back to look our did_model, we can notice that by regressing the results we got at that stage were our \(\beta_2\) and \(\beta{DD}\).
Your evaluation report is so convincing that the Director of the Parks Department, Ron Swanson, is even doubting libertarianism.
You are hired as an outside consultant by the Organization of Economic Non-Cooperation for Development (OENCD), they are interested in studying the effect of a carbon tax on national carbon dioxide emissions per capita. You are provided with data for the twenty-members of the organization from 2009 to 2019. The data can be called a true-panel based on the description given in the lecture
carbon_tax_df <- read.csv("./data/carbon_tax_df.csv") # simulated data
names(carbon_tax_df) # to check the names of the variables in our data
## [1] "country_name" "country_code" "year" "tax"
## [5] "income_class" "co2_per_capita"
Our dataset carbon_tax_df, contains the following information:
country_name: Name of the countrycountry_code: Three-letter country codeyear: Yeartax: Dummy for whether the carbon tax was in placeincome_class: Categorical variable with income label (Low to High)co2_per_capita: carbon dioxide emissions per capita in metric tons (T)
We can use what we have learnt about the base R table() function, to check how many countries had a tax in place every year.
table(carbon_tax_df$tax, carbon_tax_df$year) %>%
kable() %>% # create kable table
kable_styling() # view kable table
| 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 12 | 12 | 10 | 9 | 8 | 8 | 5 | 4 | 4 | 4 | 5 |
| 1 | 0 | 0 | 2 | 3 | 4 | 4 | 7 | 8 | 8 | 8 | 7 |
ggplot(carbon_tax_df, aes(x = year,
y= co2_per_capita,
color = factor(tax))) +
geom_point() + #create scatterplot
theme_minimal() +
theme(legend.position="bottom") + #move legend to the bottom
labs(title = "Exploratory plot of CO2 emissions per capita",
x = "Year",
y = "CO2 emissions in metric tons (T)",
color = "Carbon tax") +
scale_colour_discrete(labels = c("No", "In place")) #change labels of the legend
Note: The exaploratoring our data portions of this lab script will not be directly applicable to this week’s assignment. Still, exploratory data analysis will be very useful for the extension portion of your final replication paper. Summarizing, graphing, and exploring your data will be critical to discover patterns, to spot anomalies, and to check assumptions
In the example at hand, we can think of unit-specific factors as characteristics of individual countries that are constant over time (e.g. a country that just loves big pick-up trucks). We can also think about time-specific factors that affect all countries (e.g. global economic shocks).
We can formulate this as:
\[Y_{it} = β_0 + β_1D_it + \nu_{i} + θ_t + η_{it}\] where \(\nu_i\) reflects the time-invariant traits of the units \(\theta_t\) reflects the time-specific factors that affect everyone and \(\eta_{it}\) is the idiosyncratic error
We will move forward by creating three models:
co2_per_capita on tax.
naive_carbon <- lm(co2_per_capita ~ tax, data = carbon_tax_df)
stargazer(naive_carbon, type = "html")
| Dependent variable: | |
| co2_per_capita | |
| tax | -6.261*** |
| (0.566) | |
| Constant | 10.627*** |
| (0.352) | |
| Observations | 132 |
| R2 | 0.485 |
| Adjusted R2 | 0.481 |
| Residual Std. Error | 3.166 (df = 130) |
| F Statistic | 122.394*** (df = 1; 130) |
| Note: | p<0.1; p<0.05; p<0.01 |
This model is telling us that on average, the \(CO_2\) emmissions per capita are reduced by 6.2 metric tons when a carbon tax is put in place. Still, after all the work we have done throughout the semester, we understand that there may be a plethora of factors that could be skewing the results of this bivarate regression.
We will learn two ways of gathering unit- and time-fixed effects in R. First, we will perform Least Squares Dummy Variables (LSDV) estimation with lm(), where we essentially get an individual estimate for each unit. Second, we will run our model with plm(), which will do the same mechanics, yet it will not render each of the units intercept.
lsdv_unit_fe <- lm(co2_per_capita ~ tax + country_name, data = carbon_tax_df)
unit_fe <- plm(co2_per_capita ~ tax, data = carbon_tax_df, index = c("country_name"), model = "within")
stargazer(lsdv_unit_fe, type = "html")
| Dependent variable: | |
| co2_per_capita | |
| tax | -4.436*** |
| (0.474) | |
| country_nameBorovia | 1.507 |
| (0.912) | |
| country_nameCarpania | 5.106*** |
| (0.912) | |
| country_nameCorinthia | 6.434*** |
| (0.887) | |
| country_nameFreedonia | 6.120*** |
| (0.903) | |
| country_nameGlenraven | 0.372 |
| (0.951) | |
| country_nameKhemed | -0.672 |
| (0.951) | |
| country_nameLaurania | 1.533 |
| (0.937) | |
| country_nameParano | 3.693*** |
| (0.887) | |
| country_nameRon | 2.715*** |
| (0.912) | |
| country_nameRumekistan | 7.431*** |
| (0.887) | |
| country_nameTransia | 1.883* |
| (0.968) | |
| Constant | 6.912*** |
| (0.627) | |
| Observations | 132 |
| R2 | 0.797 |
| Adjusted R2 | 0.776 |
| Residual Std. Error | 2.079 (df = 119) |
| F Statistic | 38.844*** (df = 12; 119) |
| Note: | p<0.1; p<0.05; p<0.01 |
stargazer(unit_fe, type = "html")
| Dependent variable: | |
| co2_per_capita | |
| tax | -4.436*** |
| (0.474) | |
| Observations | 132 |
| R2 | 0.424 |
| Adjusted R2 | 0.366 |
| F Statistic | 87.716*** (df = 1; 119) |
| Note: | p<0.1; p<0.05; p<0.01 |
Adding unit-level fixed effects to the model, i.e. accounting for unobserved, time-invariant characteristics of countries and only focusing on within-state variation, an increase the imposition of a carbon tax reduces \(CO_2\) per capita emissions by 4.44 metric tons. Once we have captured the variation amongst countries, we can see that our results from the naive model were substantially biased. We can still try to capture the time-specific portion of the error.
The results from the Least Squares Dummy Variables (LSDV) estimation are read in reference to a baseline. In this case, the constant is representing the intercept for Adjikistan. We can utilize the individual slopes for each country to say that Freedonians emit on average 5.93 more metric tons of \(CO_2\) per capita than Adjikistanians.
We will perform our regressions with Least Squares Dummy Variables (LSDV) estimation with lm() and our simplified way with plm().
lsdv_unit_time_fe <- lm(co2_per_capita ~ tax + country_name + factor(year), data = carbon_tax_df)
unit_time_fe <- plm(co2_per_capita ~ tax, data = carbon_tax_df, index = c("country_name", "year"), model = "within", effect = "twoways")
stargazer(lsdv_unit_time_fe, type = "html")
| Dependent variable: | |
| co2_per_capita | |
| tax | -3.908*** |
| (0.280) | |
| country_nameBorovia | 1.267*** |
| (0.418) | |
| country_nameCarpania | 4.866*** |
| (0.418) | |
| country_nameCorinthia | 6.434*** |
| (0.398) | |
| country_nameFreedonia | 5.928*** |
| (0.410) | |
| country_nameGlenraven | -0.012 |
| (0.447) | |
| country_nameKhemed | -1.056** |
| (0.447) | |
| country_nameLaurania | 1.196*** |
| (0.436) | |
| country_nameParano | 3.693*** |
| (0.398) | |
| country_nameRon | 2.474*** |
| (0.418) | |
| country_nameRumekistan | 7.431*** |
| (0.398) | |
| country_nameTransia | 1.450*** |
| (0.459) | |
| factor(year)2010 | -4.697*** |
| (0.381) | |
| factor(year)2011 | 1.890*** |
| (0.384) | |
| factor(year)2012 | -3.061*** |
| (0.387) | |
| factor(year)2013 | 0.106 |
| (0.392) | |
| factor(year)2014 | -1.878*** |
| (0.392) | |
| factor(year)2015 | -3.018*** |
| (0.414) | |
| factor(year)2016 | -3.292*** |
| (0.424) | |
| factor(year)2017 | -0.963** |
| (0.424) | |
| factor(year)2018 | -2.488*** |
| (0.424) | |
| factor(year)2019 | -1.399*** |
| (0.414) | |
| Constant | 8.621*** |
| (0.396) | |
| Observations | 132 |
| R2 | 0.963 |
| Adjusted R2 | 0.955 |
| Residual Std. Error | 0.933 (df = 109) |
| F Statistic | 127.305*** (df = 22; 109) |
| Note: | p<0.1; p<0.05; p<0.01 |
stargazer(unit_time_fe, type = "html")
| Dependent variable: | |
| co2_per_capita | |
| tax | -3.908*** |
| (0.280) | |
| Observations | 132 |
| R2 | 0.641 |
| Adjusted R2 | 0.568 |
| F Statistic | 194.229*** (df = 1; 109) |
| Note: | p<0.1; p<0.05; p<0.01 |
Now in addition to adding unit-level fixed effects to the model, we control for time-specific factors that affect the global per capita \(CO_2\) emissions. The results suggest that the effect of a carbon-tax leads to a decrease \(CO_2\) emissions of 3.91 metric tons per capita.
Your results are welcomed binternationally and all states move forward with the measure.